Coursework for Principles of Data Science

Title : An assessment of UK Economy through Payment Transactions

Index for the Notebook

Section Guide

1.0 Dataset Source Description of the sources for the datasets used

2.0 Data Preparation Source file upload to Pandas Dataframe, cleaning, labelling, data type conversion, date formatting, imputing value, calculation of absolute values, datasets merger

3.1 Data Derivation Part I - Payment Transactions Consolidation of multiple dataframe into four main dataframe for High Value, Low Value, Cheque and Credit Card Payments

3.2 Data Derivation Part II - GDP Consolidation of multiple sources into one and conversion of different date formats into a consistent quarterly format

4.1 Exploratory Data Analysis - Payments Data Timeseries, Scatterplot and Boxplots used to check Data Quality and Characteristics

4.2 Exploratory Data Analysis - GDP Data Timeseries, Scatterplot and Boxplots used to check Data Quality and Characteristics

5.0 Construction of Model Feature selection, Feature Engineering, Correlation and Variance Inflation Analysis, model options review, model selection

5.1 Simple Linear Regression Model and Evaluation OLS Linear Regression, Model Summary, Harvey-Colliear Linearity Test, Residual Histogram plot

5.2 Generalised Least Square Method An alternative approach to evaluate the performance of OLS Model. GLS is expected to address the OLS limitations for autocorrelations with lags in the predictor variable

5.3 Multiple Linear Regression An alternative approach to evaluate the performance of OLS Model. MLR is applied to check if it is a better approach than OLS Simple Linear Regression or if multicollinearity impacts the robustness and performance of the model

6.0 Model Validation Additional methods like Sklearn Linear Regression with Cross Fold Validation and Polynomial fit modelling used for assessing OLS Simple Linear Regression Model robustness and performance. Review for overfitting/coefficient variance for different samples

6.1 SK Learn Linear Regression and Cross Fold Validation Comparison of model outputs from a different Linear Regression Method and the impact of sample size on the robustness of the model by using K Fold Cross Validation Technique

6.1 Polynomial Fit Regression Model Polynomial Fit Regression model used with various orders to assess any quadratic, cubic or higher order relationship between the predictor variables and target variable

1.0 Dataset Source

The analysis is conducted using UK payment systems and GDP data available in the public domain as per below links. The dataset is split into three types - Bank Account Transactions , Credit Card Transaxctions and GDP

A. Bank Account Payment Transactions Timeseries for UK is available under https://www.wearepay.uk/factsandfigures/ in one file (filename : Historical-Monthly-Payment-Statistics-1990-to-Sep-2020) split into multiple tabs as per below

  1. BACS : Monthly Timeseries available from Jan 1990- Sep 2020 includes three different types of transactions (Standing Orders, Direct Debit, Direct Credit) for volume and two features for values (All Credits = Standing Orders plus Direct Credit and Direct Debits)

  2. CHAPS : Monthly Timeseries available from Jan 1990- Sep 2020 includes two main features for value and volume of GBP Payments Few records are also available for payments made in Euro currency from Jan 1998 to May 2008. These records were ignored for the analysis as dataset is not consistently available nor is it deemed to have a significant impact for the analysis

  3. MonthlyCheque : Monthly Timeseries for Physical Cheques is available from Jan 1990 till Aug 2019 as the cheque processing was migrated to electronic imaging from 2019 onwards. Data is also avilable split between London and Scotland. Physical Cheque processing was migrated to electronic imaging since 2019

  4. Credit : Giro payments data available from Jan 1990 till Aug 2019 as processing was migrated to electronic imaging from 2019 onwards

  5. Faster Payments : Faster Payments was introduced from 2008 May. Monthly Timeseries is available since this time and it contains four different transaction volume and value related fields. three of the fields were combined to get consolidated volume and value. The Return payments value and volume was ignored for this analysis as the impact was less than 0.5%

  6. Image Clearing : Live since 2019 and hence Monthly Timeseries is available since that date. Historic Physical Cheques and Giro transactions are now expected to be settled through this clearing

B. Credit Card Payments : Data available under https://www.ukfinance.org.uk/data-and-research/data/cards/card-spending

  1. Monthly Timeseries data is avilable since Jan 1994 till Sep 2020 (and ongoing). File has a number of fields related to transactions value,volume outstanding credit, number of active cards etc

C. GDP - two datasets have been used for source of UK GDP data. Both a Quarterly Timeseries

  1. https://docs.google.com/spreadsheets/d/1_rJJuTPWB-J8jkRZoeRi3De2tMLBxVcPKN2D-89lkUk/edit#gid=10 (through https://www.theguardian.com/news/datablog/2009/nov/25/gdp-uk-1948-growth-economy#data) with GDP presented as both Nominal and Inflation adjusted and ONS. Available from 1955 to 2013

  2. https://fred.stlouisfed.org/series/RGDPMPUKQ with inflation adjusted GDP available from 1955 to 2016

2.0 Data Preparation

A number of steps below are executed to prepare dataset for the Exploratory Data Analysis. This ensures that the datasets are created from source files after ignoring the header and footer rows with commentary/narrative, have a consistent date format across the time series as the source files do not contain the year/month combination for all rows for an automatic upload and manual assignment of column header as the source files have column header across more than one row. Steps below include

  1. Upload of the source files into a pandas dataframe ,
  2. Deleting the header and footer rows on the file as part of the upload process as a number of records have title and descriptions not required for the analysis,
  3. Derivation of the datetime for the timeseries by imputing the year value for missing rows
  4. Forward fill used to fill the NaN values in the year column as the source file only has the year mentioned once per year instead of 12 times a year against each month
  5. Assigning the absolute values to transaction volume (multiplied by 1000) and amount (multiplied by 1,000,000)
  6. GDP Period Date was changed from beginning of Quarter to End of Quarter to match with the end of month dates for the Payments data

3.1 Data Derivation Part I - Payment Transactions

Progressive outer joins used on all types of Payments dataframes for creating one unified payment dataframe combining all types of Bank Account Payments (CHAPS, BACS, FPS, Cheques). The resulting dataset was inner joined with Credit Card Payments to create an unified Payments dataset. Inner join was used as Credit Card data is not available prior to 1994 unlike the Bank Account Payments data. The unified dataframe was then condensed with reduced number of featuresto facilitate the exploratory data analysis for evaluating the modelling steps.

As Faster Payments was mainly introduced to faciliate same day transactions with the longer term view for replacing BACS. The dataframes for BACS and FPS have been combined to the volume and value as overall Low Value Payments. New fields were created for Transaction Value and Transaction Volume combining three types of BACS and three types of Faster Payments transactions. Also, Faster Payments has only been available since 2008 so the values prior to this date was taken as zero for combining the data frameworks

Similarly, the Physical Cheques, Giro and Digital Cheques dataframes were merged to create one dateframe with total value and volume of Cheque transactions. As the Physical Cheques and Giro data is only available till 2019 and Digital Cheque since 2019, an empty set of records were created with zero values for the period of 2019-2020 before merging the dataframe with Digital Cheques data

The dataframes were then incrementally combined (e.g. BACS with FPS, BACS+FPS with CHAPS, BACS+FPS+CHAPS with Cheques and Credit Card) to create a unified dataframe for all types of Payments with new aggregated Volume and Value fields to reduce the number of features across individual dataframes.

4.1 Exploratory Data Analysis - Payments Data

Timeseries, Scatterplot and Boxplots used to check Data Quality and Characteristics

A scatter plot is used as a visual analytics method for both Transaction Volumne and Transaction Value to assess the data quality as well as to understand the relationship between the continuous variables. Initially the scatter plot showed anamolies and on investigation, issues were identified with the source data quality. BACS source file had a number formatting mask for only a certain range of values, Cheques data was available across different columns over different periods. Both these issues were corrected and the source files were re uploaded.

Scatterplots with corrected source data exhibits a positive correlation between High Value Payments , Low Value Payments and Credit Card Payments and a negative corelation with Cheques Payments. This validates the declining usage of cheques in UK over a period of time as more payments are made electronically and using cards. The data is therefore considered reliable as it reflects the known trends in the Payments world.

Yearly and Monthly Timeseries plot and boxplots are used to visually analyse the timeseries for the Transaction Volume and Value across High Value, Low Value, Cheques and Credit Card Payments. The Yearly distribution shows presence of more outliers compared to the Monthly distributions.

Visually the Yearly Boxplots indicates trend i.e. a Non Stationary timeseries i.e a the mean and variance change over period of time. The Monthly Boxplot shows seasonality

A summary statistical description of the Payment Transactions volume and value dataframe is not very useful to understand the characteristics of the timeseries data as mean and variance for a Timeseries should be analysed over a period of time to identify Stationary vs Non Stationary timeseries.

3.2 Data Derivation Part II - GDP

A single dataset was created to combine the GDP information available from two different sources. The quarterly series from these two sources have two different date formats (YYYY-Qq vs DD/MM/YYYY). The date format has been standardised and also converted to end of period/quarter for consistency

4.2 Exploratory Data Analysis - GDP Data

Timeseries plot is analysed alongside Year and Quarter Boxplots for the Nominal and Inflation adjusted GDP. The Timeseries plot indicates that the mean is trending up while the Boxplots show a right skewed distribition for Inflation Adjusted GDPs from ONS and FED sources while the Nominal GDP appears to have a better alignment to normal distribution

5.0 Construction of Model - Selection, Alternatives, Comparison and Evaluation

Correlation analysis is performed on the unified dataset created by merging the Target and Predictor variables datasets. A very high degree of positive correlation is observed between the Electronic Payment methods (High Value, Low Value and Cards Payments) and a negative correlation is observed between Cheque Payments and Electronic Payment methods. Variance Inflation Factor used to identify MultiColinearity Issue for Multiple Linear Regression Modelling

New Features (Average Transaction Value = Value of Payment / Volume of Payment) are created as predictor variables to combine the inpact of transactionality in the economy and the magnitude of the transactions.

Pairgrid plots was used on these new feature variables and two types of target varibales (Nominal GDP and Inflation Adjusted GDP) to visually assess the distribution of individual paraemeters and correlation of feature variable amonsgt themselves as well as to target variable. Whilst a clear relationship can be identified between the Target Variable and some of the feature variables (Average Credit Card Transaction amount to GDP, Average Low Value transaction amount to GDP) there is also a non monotonic relationship highlighted between GDP and Average High Value Transaction amount

Regression and Residual plots are used to identify the best feature variable for the Regression Model. Patterns are observed in the residual plots for the feature variable to target variable. These patterns are less pronounced for GDP-Average High Value Transaction Amount compared to GDP-Average Credit Card and GDP-Low Value Transaction Amount.

Visual methods applied to both GDP Nominal and GDP Inflation Adjusted Target to assess major differences

5.1 Simple Linear Regression Models

Ordinary Least Square , Statmodel Linear Regression method is used for modelling GDP from Low Value Payments as the Predictor variable. Comparison is done for Simple Linear Regression Model using the Average Payment Amount feature of High Value and Credit Card Payments.

Each model run has model summary and residual plot to review the outcome and evaluate model chracteristics.

Model Evaluation separately done using Harvey-Collier Test and Histogram plot of Residuals

5.1.1 Simple Linear Regression Model - Average Low Value Transaction Amount

5.1.1 Model Evaluation

Linearity : Harvey-Collier Test

To test if the feature variable and target variable have the desired attributes for Simple Linear Regression

Predicted by Actual values plot

To check overfitting or underfitting issues

Histogram of Residuals

To check the normality and variance of residual distribution

5.2 Generalized Least Square Method

Alternative approach to OLS used to review the robustness of Timeseries modelling by addressing the autocorrelation aspect of predictor and target variables

5.2.1 GLS Statsmodel for Low Value Payments

Model run with summary and residual plot to compare the corresponding results from OLS Simple Regression results

5.1.2 Simple Linear Regression Model - Average Credit Card Transaction Amount

5.2.2 GLS for Credit Card Payments

Model run with summary and residual plot to compare the corresponding results from OLS Simple Regression results

5.1.3 Simple Linear Regression Model - Average High Value Transaction Amount

5.2.3 GLS Regression for Average High Value Payment

Model run with summary and residual plot to compare the corresponding results from OLS Simple Regression results

5.3 Multiple Linear Regression Model

Evaluation of the OLS Simple Regression Model appproach with different combinations of feature variables in a Multi Linear Regression Model

5.3.1 MLR with Average Low Value , Credit Card and High Value Transaction Amount

Modelling based on all three feature/predictor variables and assessment of the summary and residuals for comparision of Robustness and Residual distribution with the OLS Simple Linear Regression outputs

5.3.2 MLR with Average Low Value and High Value Transaction Amount

Modelling based on all two feature/predictor variables and assessment of the summary and residuals for comparision of Robustness and Residual distribution with the OLS Simple Linear Regression outputs

5.3.3 MLR with Average Credit Card and High Value Transaction Amount

Modelling based on all two feature/predictor variables and assessment of the summary and residuals for comparision of Robustness and Residual distribution with the OLS Simple Linear Regression outputs

6.0 Model Validation

In adition to the Model Evaluation Techniques of prior section (Study of Linearity, Coeffiecients and Residuals from OLS Simple Linear Regression and comparision with Generalised Linear Regression and Multiple Linear Regression model summary, additional methods like Sklearn Linear Regression with Cross Fold Validation and Polynomial fit modelling used for assessing OLS Simple Linear Regression Model robustness and performance

6.1 Simple Linear Regression - Kfold Validation

Use of an alternate Linear Regression Method to compare the coefficient from OLS Simple Linear Regression

Use of Cross Fold Validation to check overfitting / underfitting issues as well as variance of the coefficients for random sample sizes

6.2 Polynomial Fitting for single variable

Polynomial fit used for checking if second or higher order fit address the residuals distribution issue identified from the OLS simple Linear Regression modelling. However, results dont indicate a non-linear relationship between Predictor and Target Variables

6.3 Polynomial Fitting for multiple predictor variable

Polynomial fit used on multiple Predictor Varibales (Low Value, High Value and Credit Card Payments) for checking if second or higher order fit address the residuals distribution issue identified from the OLS simple Linear Regression modelling. However, results dont indicate a non-linear relationship between Predictor and Target Variables

References

Data Sources

Coding inspiration

Timeseries https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781789346466/1/ch01lvl1sec12/plotting-a-time-series-chart

Interpreting Skewness and Kurtosis

https://medium.com/@atanudan/kurtosis-skew-function-in-pandas-aa63d72e20de

Peter H. Westfall (2014) Kurtosis as Peakedness, 1905–2014. R.I.P., The American Statistician, 68:3, 191-195, DOI: 10.1080/00031305.2014.917055

https://codeburst.io/2-important-statistics-terms-you-need-to-know-in-data-science-skewness-and-kurtosis-388fef94eeaa

Smoothing https://www.statsmodels.org/stable/examples/notebooks/generated/exponential_smoothing.html

https://www.statsmodels.org/stable/examples/notebooks/generated/exponential_smoothing.html

https://machinelearningmastery.com/time-series-seasonality-with-python/

https://towardsdatascience.com/holt-winters-exponential-smoothing-d703072c0572

https://www.kaggle.com/kashnitsky/topic-9-part-1-time-series-analysis-in-python

https://medium.com/datadriveninvestor/how-to-build-exponential-smoothing-models-using-python-simple-exponential-smoothing-holt-and-da371189e1a1

Time Series Decomposition - Level, Trend, Seasonality https://machinelearningmastery.com/decompose-time-series-data-trend-seasonality/#:~:text=Trend%3A%20The%20increasing%20or%20decreasing,random%20variation%20in%20the%20series.

Model Evaluation https://towardsdatascience.com/perform-regression-diagnostics-and-tackle-uncertainties-of-linear-models-1372a03b1f56

Bank of England, Real Gross Domestic Product at Market Prices in the United Kingdom [RGDPMPUKQ], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/RGDPMPUKQ, December 9, 2020. https://fred.stlouisfed.org/series/RGDPMPUKQ

GLSAR Regression https://www.statsmodels.org/stable/examples/notebooks/generated/gls.html

Citation Samuel H. Williamson, "What Was the U.K. GDP Then?" MeasuringWorth 2020 URL: http://www.measuringworth.com/ukgdp/

Stephanie Glen. "Variance Inflation Factor" From StatisticsHowTo.com: Elementary Statistics for the rest of us! https://www.statisticshowto.com/variance-inflation-factor/

Capstone/Modeling Part 1. Univariate Time Series Analysis..ipynb at master · AndreaYoss/Capstone (github.com) Capstone/Modeling Part 2. Multivariate Time Series Analysis..ipynb at master · AndreaYoss/Capstone (github.com)